Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-18206][ML]Add instrumentation for MLP,NB,LDA,AFT,GLM,Isotonic,LiR #15671

Closed
wants to merge 21 commits into from

Conversation

zhengruifeng
Copy link
Contributor

@zhengruifeng zhengruifeng commented Oct 28, 2016

What changes were proposed in this pull request?

add instrumentation for MLP,NB,LDA,AFT,GLM,Isotonic,LiR

How was this patch tested?

local test in spark-shell

@SparkQA
Copy link

SparkQA commented Oct 28, 2016

Test build #67700 has finished for PR 15671 at commit 181e1b2.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 28, 2016

Test build #67703 has finished for PR 15671 at commit ad707d2.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 28, 2016

Test build #67704 has finished for PR 15671 at commit 6fabd70.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 28, 2016

Test build #67706 has finished for PR 15671 at commit 7c74241.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -96,11 +96,13 @@ class DecisionTreeClassifier @Since("1.4.0") (

val instr = Instrumentation.create(this, oldDataset)
instr.logParams(params: _*)
instr.logNumClasses(numClasses)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are already logged inside of RandomForest.run

@zhengruifeng
Copy link
Contributor Author

@sethah Thanks. I have revert those changes in Tree-Algos.

@SparkQA
Copy link

SparkQA commented Oct 29, 2016

Test build #67741 has finished for PR 15671 at commit 05e2f4d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@zhengruifeng
Copy link
Contributor Author

@jkbradley @yanboliang Could you please make a reivew?

@sethah
Copy link
Contributor

sethah commented Oct 31, 2016

Minor: would you mind changing the title to "Add instrumentation logs to ML training algorithms". I initially thought this PR was adding them to the old MLlib API, so I think it's a bit more clear.

Copy link
Contributor

@sethah sethah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left mostly minor comments. Let's create a JIRA for other meta algorithms like CrossValidator, TrainValidationSplit, and OneVSRest. Thanks for doing this!

@@ -234,8 +234,14 @@ class MultilayerPerceptronClassifier @Since("1.5.0") (
* @return Fitted model
*/
override protected def train(dataset: Dataset[_]): MultilayerPerceptronClassificationModel = {
val instr = Instrumentation.create(this, dataset)
instr.logParams(params : _*)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, we are trying to log all params that are useful, but not params that could overload logs. Since MLP has a initialWeights parameter that could potentially be very large, we should not include it here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I will update it here, and other algos which support a initalModel

@@ -216,7 +221,10 @@ class NaiveBayes @Since("1.5.0") (

val pi = Vectors.dense(piArray)
val theta = new DenseMatrix(numLabels, numFeatures, thetaArray, true)
new NaiveBayesModel(uid, pi, theta).setOldLabels(labelArray)
val model = new NaiveBayesModel(uid, pi, theta).setOldLabels(labelArray)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: remove empty line

@@ -896,6 +896,10 @@ class LDA @Since("1.6.0") (
@Since("2.0.0")
override def fit(dataset: Dataset[_]): LDAModel = {
transformSchema(dataset.schema, logging = true)

val instr = Instrumentation.create(this, dataset)
instr.logParams(params : _*)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Likewise here we probably should not log docConcentration.

copyValues(newModel).setParent(this)

val model = copyValues(newModel).setParent(this)
instr.logSuccess(newModel)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: instr.logSuccess(model)

@@ -226,6 +226,10 @@ class AFTSurvivalRegression @Since("1.6.0") (@Since("1.6.0") override val uid: S
val featuresStd = featuresSummarizer.variance.toArray.map(math.sqrt)
val numFeatures = featuresStd.size

val instr = Instrumentation.create(this, dataset)
instr.logParams(params : _*)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we should not log quantileProbabilities? I am not familiar with this algorithm so I'm not sure if these can ever be too large.

@@ -276,6 +280,7 @@ class AFTSurvivalRegression @Since("1.6.0") (@Since("1.6.0") override val uid: S
val intercept = parameters(1)
val scale = math.exp(parameters(0))
val model = new AFTSurvivalRegressionModel(uid, coefficients, intercept, scale)
instr.logSuccess(model)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: move the logSuccess call after the copyValues. That way if we ever do something in logSuccess like logging parameters of the model, the call will reflect the copied parameters.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, would you mind changing the doc in Instrumentation.logSuccess from:

"Logs the successful completion of the training session and the value of the learned model."

to

"Logs the successful completion of the training session."

Since we currently don't do anything other than log a string inside logSuccess.

@@ -284,6 +288,7 @@ class GeneralizedLinearRegression @Since("2.0.0") (@Since("2.0.0") override val
.setParent(this))
val trainingSummary = new GeneralizedLinearRegressionTrainingSummary(dataset, model,
irlsModel.diagInvAtWA.toArray, irlsModel.numIterations, getSolver)
instr.logSuccess(model)
model.setSummary(trainingSummary)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: move logSuccess after setSummary

@@ -284,6 +288,7 @@ class GeneralizedLinearRegression @Since("2.0.0") (@Since("2.0.0") override val
.setParent(this))
val trainingSummary = new GeneralizedLinearRegressionTrainingSummary(dataset, model,
irlsModel.diagInvAtWA.toArray, irlsModel.numIterations, getSolver)
instr.logSuccess(model)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to log success if we enter the WeightedLeastSquares branch above as well. Maybe we can refactor the code to use if else instead of an if with a return. That way we only need to log success in one place.

@@ -392,6 +399,8 @@ class LinearRegression @Since("1.3.0") (@Since("1.3.0") override val uid: String
model,
Array(0D),
objectiveHistory)

instr.logSuccess(model)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: move logSuccess after setSummary, here and elsewhere.

@zhengruifeng zhengruifeng changed the title [SPARK-14567][ML]Add instrumentation logs to MLlib training algorithms [SPARK-14567][ML]Add instrumentation logs to ML training algorithms Nov 1, 2016
@zhengruifeng
Copy link
Contributor Author

@sethah Thanks for your reviewing. I have made changes according to your comments. And I will create JIRAs for meta algos.

@SparkQA
Copy link

SparkQA commented Nov 1, 2016

Test build #67878 has finished for PR 15671 at commit df8734e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jkbradley
Copy link
Member

I just made a subtask for the algs covered here. Can you please link this PR to SPARK-18206 in the title instead of the umbrella task?

@sethah
Copy link
Contributor

sethah commented Nov 1, 2016

So, if we consistently use instr.logParams(params: _*) (even in cases where it's acceptable) then we run the risk of adding some param in the future that could "overload" the logs (like initialModel). However, if we manually select the appropriate params to log, then we risk adding some other param in the future which we do want to log, but it never gets added. Both could be problematic.

For now, I think I lean towards manually selecting which params to log rather than logging all params. If we add more params later we will have to remember to add them to the logging. What are others' thoughts?

Alternatively, we could use a filter on certain types of params - like Array[_]. Or we could pass in specific params to exclude in the function call.

@jkbradley
Copy link
Member

Good point. I agree with you about logging selected Params only.

@zhengruifeng
Copy link
Contributor Author

@sethah @jkbradley OK, I will link this pr to the subtask and change algos which use instr.logParams(params: _*) to explicitly select some params.

@zhengruifeng zhengruifeng changed the title [SPARK-14567][ML]Add instrumentation logs to ML training algorithms [SPARK-18206][ML]Add instrumentation logs to ML training algorithms Nov 2, 2016
@SparkQA
Copy link

SparkQA commented Nov 2, 2016

Test build #67947 has finished for PR 15671 at commit 6d2d13f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@sethah
Copy link
Contributor

sethah commented Nov 2, 2016

IMO, the way we're doing this logging right now is unsustainable. It requires too much manual work. We can leave this discussion for a different JIRA, but what we could do is modify the Instrumentation class to just truncate the param value string after a certain number of characters. Then, we could even modify Predictor.fit to create Instrumentation and log all params. The only thing we'd have to do in the individual algorithms is log anything else - algorithm specific - that we want to add. I haven't tested this yet. @jkbradley @zhengruifeng what do you think?

Copy link
Contributor

@sethah sethah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few minor comments otherwise LGTM. Still, we definitely need to find a way to make this more sustainable.

return model.setSummary(trainingSummary)
model.setSummary(trainingSummary)
instr.logSuccess(model)
return model
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would you mind rewriting this if statement to:

val model = if (familyObj == Gaussian && linkObj == Identity) {
  ...
} else {
  ...
}
instr.logSuccess(model)
model

Then we can avoid the code duplication?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I will change this.

@@ -227,7 +227,8 @@ class AFTSurvivalRegression @Since("1.6.0") (@Since("1.6.0") override val uid: S
val numFeatures = featuresStd.size

val instr = Instrumentation.create(this, dataset)
instr.logParams(params : _*)
instr.logParams(labelCol, featuresCol, censorCol, predictionCol, quantilesCol,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add aggregationDepth

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, Thanks

@@ -196,7 +196,8 @@ class LinearRegression @Since("1.3.0") (@Since("1.3.0") override val uid: String
}

val instr = Instrumentation.create(this, dataset)
instr.logParams(params : _*)
instr.logParams(labelCol, featuresCol, weightCol, predictionCol, solver, tol,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add aggregationDepth

@zhengruifeng
Copy link
Contributor Author

@sethah I have make changes according to the comments. Thanks for your reviewing.

@zhengruifeng
Copy link
Contributor Author

@sethah I agree that manually listing traceable params is prone to mistake. I think we can log all params expect some params which are labeled dont-log in the individual algorithms. Or we can create a new methods def logParams(params: Seq[Param], except: Seq[Param]).

@SparkQA
Copy link

SparkQA commented Nov 3, 2016

Test build #68039 has finished for PR 15671 at commit e77bdc4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@sethah
Copy link
Contributor

sethah commented Nov 3, 2016

I created SPARK-18253 to track it. We may have to get to it after 2.1 QA period.

@jkbradley
Copy link
Member

I don't want to truncate Param strings because it would create invalid JSON in case people want to try to catch and parse the logs. I like the idea of allowing exceptions and possibly adding unit tests to ensure the logs do not blow up.

@zhengruifeng
Copy link
Contributor Author

@jkbradley Update according to your comments, including adding quantileProbabilities and docConcentration.

@SparkQA
Copy link

SparkQA commented Jan 5, 2017

Test build #70901 has finished for PR 15671 at commit c8693d8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 9, 2017

Test build #71064 has finished for PR 15671 at commit e6b4615.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@zhengruifeng
Copy link
Contributor Author

ping @jkbradley

@jkbradley
Copy link
Member

Thanks for the updates!

For docConcentration and quantileProbabilities, I agree it could be problematic if these are too large. How about:

  • We don't log docConcentration since that could easily be large and since it is less likely to cause errors which users will interpret as bugs.
  • We log quantileProbabilities.size since a large array could cause problems.

@SparkQA
Copy link

SparkQA commented Jan 13, 2017

Test build #71285 has finished for PR 15671 at commit c8188b0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@zhengruifeng
Copy link
Contributor Author

@jkbradley Updated. Thanks for reviewing!

@zhengruifeng
Copy link
Contributor Author

re-ping @jkbradley

@jkbradley
Copy link
Member

LGTM pending one more run of the tests
Thanks a lot @zhengruifeng !

Jenkins test this please

@SparkQA
Copy link

SparkQA commented Jan 17, 2017

Test build #3537 has finished for PR 15671 at commit c8188b0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jkbradley
Copy link
Member

Merging with master
Thanks!

@asfgit asfgit closed this in e7f982b Jan 17, 2017
@zhengruifeng zhengruifeng deleted the lir_instr branch January 18, 2017 01:42
uzadude pushed a commit to uzadude/spark that referenced this pull request Jan 27, 2017
…,LiR

## What changes were proposed in this pull request?

add instrumentation for MLP,NB,LDA,AFT,GLM,Isotonic,LiR
## How was this patch tested?

local test in spark-shell

Author: Zheng RuiFeng <[email protected]>
Author: Ruifeng Zheng <[email protected]>

Closes apache#15671 from zhengruifeng/lir_instr.
cmonkey pushed a commit to cmonkey/spark that referenced this pull request Feb 15, 2017
…,LiR

## What changes were proposed in this pull request?

add instrumentation for MLP,NB,LDA,AFT,GLM,Isotonic,LiR
## How was this patch tested?

local test in spark-shell

Author: Zheng RuiFeng <[email protected]>
Author: Ruifeng Zheng <[email protected]>

Closes apache#15671 from zhengruifeng/lir_instr.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants